Skip to content

Comments

Implement multi-site local sensitivity analysis workflow#1

Open
divine7022 wants to merge 25 commits intoccmmf:mainfrom
divine7022:local-SA
Open

Implement multi-site local sensitivity analysis workflow#1
divine7022 wants to merge 25 commits intoccmmf:mainfrom
divine7022:local-SA

Conversation

@divine7022
Copy link
Collaborator

@divine7022 divine7022 commented Nov 17, 2025

Summary:

Implements a complete multi-site local sensitivity analysis (SA) workflow for SIPNET as specified in #151. This PR delivers aggregated sensitivity results across design points, environmental gradient analysis, and a Quarto-based report summarizing parameter importance patterns.

Addresses Issue #151 Requirements

@dlebauer
Copy link
Contributor

dlebauer commented Nov 21, 2025

BTW this looks really nice, well done. Though I'd like to work through the steps as I review.

Could you please add a readme with instructions on how to use the repository?

@divine7022
Copy link
Collaborator Author

thanks for reviewing, updated readme in #2

Copy link

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized I left these comments unposted on my partial readthrough the other day, so adding now. Would it be useful for me to finish a full review now or wait for updates first?

Comment on lines 35 to 44
#----------------------------------
# read.settings() uses xmlToList loses duplicate <variable> tags, so read XML directly
library(XML)
xml_doc <- XML::xmlParse(file.path("output", "pecan.CONFIGS.xml"))
sa_variables <- unique(XML::xpathSApply(
xml_doc,
"//sensitivity.analysis//variable",
XML::xmlValue
))
XML::free(xml_doc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#----------------------------------
# read.settings() uses xmlToList loses duplicate <variable> tags, so read XML directly
library(XML)
xml_doc <- XML::xmlParse(file.path("output", "pecan.CONFIGS.xml"))
sa_variables <- unique(XML::xpathSApply(
xml_doc,
"//sensitivity.analysis//variable",
XML::xmlValue
))
XML::free(xml_doc)
sa_variables <- settings$sensitivity.analysis |>
(\(x) x[names(x) == "variable"])() |>
unlist() |>
unique()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -0,0 +1,108 @@
#!/usr/bin/env Rscript

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this site selection process differ from what David already implemented in the downscaling repo? Seems preferable to use an existing tool rather than add that complexity to the UA process.

Copy link
Collaborator Author

@divine7022 divine7022 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored this script to completely remove the clustering logic. Instead it now acts purely as a consumer of the shared design_points.csv artifact (currently the david's manually selected 198 sites) to perform the necessary PFT mapping and sub-sampling for UA.

We will treat this as the interface for now, and can revisit or refactor the integration logic once the broader coordination strategy between the Downscaling and Uncertainty workflows is finalized.

see comment #1 (comment)

@@ -0,0 +1,497 @@
#' Aggregate local sensitivity results across sites

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what "local" means here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch regarding the ambiguity

In this context, 'Local' refers to the mathematical method used by pecan's variance decomposition one-at-a-time (OAT) SA, where parameters are varied individually around their median to calculate partial derivatives (elasticities). This is distinct from the global sensitivity analysis we perform later.

I have updated the function documentation to explicitly state this aggregates one-at-a-time (OAT) parameter sensitivity results to avoid confusion with geographical locality.

#' Aggregate local sensitivity results across sites
#'
#' @param sensitivity_outdir Directory containing PEcAn sensitivity outputs
#' @param design_points Data frame of site metadata

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend documenting what columns are expected in this df, and whether unexpected columns are ignored or cause errors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 documented

)

# Process all files
all_results <- purrr::map_dfr(sa_files, function(sa_file) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be much easier to read and understand if you define it as a function with an informative name -- yes, even if it's only called once as all_results <- purrr::map_dfr(sa_files, name_of_function)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have extracted the file processing logic into a standalone helper function process_sa_file

@divine7022
Copy link
Collaborator Author

Would it be useful for me to finish a full review now or wait for updates first?

Thanks for the comments,
@infotroph let me address these first, you can continue the full review once I push the next update

Comment on lines +7 to +15
# Currently, this script consumes the 'design_points.csv' from the shared
# directory. At this stage of the project, these are MANUALLY SELECTED
# points (198 sites), not yet the output of the automated clustering
# workflow.
#
# TODO: Once the integration architecture between the Downscaling and
# Uncertainty repos is finalized, refactor this script, continue to consume
# the artifact generated by that pipeline.
# =======================================================================
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Contributor

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is very nice work. The code is clear and easy to understand.

I've read through the scripts and R functions. My next step will be to review the report. I've made a few comments. Not all need to be done now, but please write tickets or todos to capture future work.

pecan_xml_template: "data_raw/template.xml"
sites:
design_points_file: "data_raw/sa_design_points.csv"
sensitivity:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal with this file was to handle settings that aren't in the pecan xml. Is there a reason that these are included here rather than the template.xml?

Minimizing config options here, and providing sensible defaults in the template.xml could make it more clear to end users.

@@ -8,8 +8,19 @@ default:
raw_data_dir: "data_raw"
cache_dir: "cache"
pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have ccmmf_dir, can we change this to use that as a variable? That way it is only necessary to change the system-specific path once. (also applies to master_design points etc. )

Suggested change
pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701"
pecan_outdir: "$ccmmf_dir/modelout/ccmmf_phase_2b_mixed_pfts_20250701"

raw_data_dir: "data_raw"
cache_dir: "cache"
pecan_outdir: "/projectnb2/dietzelab/ccmmf/modelout/ccmmf_phase_2b_mixed_pfts_20250701"
master_design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these are just design points ...

Suggested change
master_design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv"
design_points: "/projectnb2/dietzelab/ccmmf/data/design_points.csv"

cfg <- config::get(file = "000-config.yml")

# Define paths based on config
master_file <- cfg$paths$master_design_points
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These may seem like trivial naming requests, but clarity and specificity is important.

  • why is this 'master_design_points' ... that implies that there are other sets of design points.
  • I would call this object design_points_csv instead of master_file
  • instead of master_data, I would call that object design_points.

)
) |>
dplyr::filter(!is.na(pft)) |>
dplyr::slice_sample(n = n_sample) |>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we sampling here?

# Prepares design points for Uncertainty Analysis.
#
# Currently, this script consumes the 'design_points.csv' from the shared
# directory. At this stage of the project, these are MANUALLY SELECTED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were not manually selected, they were from the site selection

#!/usr/bin/env Rscript

# =======================================================================
# 001_setup_design_points.R
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this script exists makes me wonder whether the clustering workflow that generates design_points.csv should be updated - that will help ensure consistency in workflows that consume it.

raw_data_dir <- cfg$paths$raw_data_dir

# SA configuration (configurable paths)
options <- list(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is unclear - why are configs hard coded here if they are already in the config file and/or template.xml?

)
site_info <- site_info |> dplyr::rename(id = site_id)

# Read template settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider - what is the minimum amount of information that we want the end users to be able to configure, and why?

In addition to file paths, I can see n_sample, start_date, end_date being useful for being able to trigger 'development' mode.

dir.create(settings$outdir, recursive = TRUE)
}

# Handle workflow resumption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth the effort and overhead to manage a STATUS file? My recollection is that the STATUS file was developed primarily for the PHP web interface to display. It seems that dropping the status file would simplify the code a lot. What benefit does it provide, if any?

This is a context where targets might be useful.

Copy link
Contributor

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the report is a nice analysis. A few changes could make the results easier to interpret.

  • Read and implement "turning tables into graphs"
  • Lead with a 3–5 bullet summary of the top findings (what parameters are most important to calibrate, how do these vary w/ environmental drivers).
  • display all of the elasticity plots at once using faceting.
  • show variance across sites in elasticity plot
  • Show sites on a map similar to the one in showing the design points in the downscaling repo
  • Replace site table with a rank‑distribution plot (median + IQR) and link to downloadable CSV.
    Make gradient results a small multiple of partial dependence plots for top 3 parameters, rather than a long table.


Parameter rankings by mean absolute elasticity identify which inputs most strongly influence model outputs, averaged across all environmental conditions. High elasticity indicates the model is highly sensitive to that parameter.

::: panel-tabset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to be able to view all of these plots at the same time, either as panels or different color bars.

By default, I prefer to show medians, especially for non-normally distributed values. If they exist, consider showing the distribution of values (e.g., across sites)


------------------------------------------------------------------------

# Site-by-Site Sensitivity {#sec-site-sensitivity}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The site-by-site table is difficult to interpret and doesn't add much once across site variance is shown in the elasticity plot. Lets remove it. Andrew Gelman’s “Let’s practice what we preach: turning tables into graphs” makes a good case for visual summaries when tables get this dense: https://sites.stat.columbia.edu/gelman/research/published/dodhia.pdf

(The reason that I put a dense table of county aggregated means in the downscaling report is that those values are the numbers that the client wants)

A map would be a nice way to show the distribution of sites. It could eventually (not required for MVP) be interesting to map elasticity of the first, say, five parameters.

A few small usability notes for the style of table:

  • when rendered as HTML, the table scrolls horizontally but headers don’t stay aligned.
  • Two significant figures likely be enough; additional precision is noise.
  • The cell color bars are helpful for relative magnitude.

Copy link
Contributor

@dlebauer dlebauer Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider something like a heat map. it isn't necessary to show every site, but perhaps summarize by environmental cluster?

See Dietze et al 2017 https://agupubs.onlinelibrary.wiley.com/doi/10.1002/2013JG002392

Image Image


# Technical Details {#sec-technical}

## Sensitivity Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to this section above, or perhaps it could be in the beginning.

- **Negative slope**: Sensitivity *decreases* along gradient
- **High** $R^2$: Strong environmental control on parameter importance

## Variance Explained Trends
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not clear why there are only two unique values of r2 in the table

:::

## Summary Statistics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this plot has two rows with the same values

Image


Regression analysis tests whether parameter sensitivity systematically varies with environmental conditions, identifying context-dependent calibration needs.

## Regression Results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one way these could be displayed is using facets

  • each row a different parameter (for top params / greatest slopes)
  • each column a different environmental covariate
  • each plot has x = env covariate and y = elasticity. Could use color to plot multiple outputs (SOC, NPP, etc) on the same plot, or create separate plots for each output.

Another note on plots - by default, keep the axes (and other scales) the same size even across different variations of a plot (e.g. when all elasticity should have the same range. Sometimes it is necessary to change this role - e.g. to show interesting trends that would otherwise be obscured. This makes it easier to compare.

parameter_rankings <- aggregated_results |>
dplyr::group_by(parameter, response_var) |>
dplyr::summarize(
# Mean absolute elasticity (primary metric for sensitivity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking the absolute value of elasticity makes sense when ranking parameter importance, but for plotting and regression, we should use untransformed elasticity.

There are lots of ways to plot this information but this is from LeBauer et al 2013 (shown here mostly to note that the direction of elasticity is also important information).

Image

@dlebauer
Copy link
Contributor

dlebauer commented Feb 22, 2026

Additional items for the "TODO" list

  • split diagnostics into a separate doc from sensitivity results; report should focus on sensitivity metrics that are useful for model development, parameterization
  • add plots for each output variable of parameter value vs output (i.e. continuous sensitivity, these should be automatically generated by PEcAn)
  • consider grouping sites by cluster (from clustering workflow)
  • consider hierarchical bayesian model using brms::brm with outputs, parameters, and sites as random effects.
  • test assumptions of regression model
  • do not report P-values. They are okay for filtering but not hypothesis testing, and the P-values here would need to be corrected for the fact that we are running lots of tests (a simple correction is p * number of times lm is called

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants